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Description 

BACKGROUND 

The parent application EP-A-0 207 665 relates to 5 
a conditional delayed branching method and to an ap- 
paratus for implementing this delayed branching.. 

The ability to make decisions by conditional 
branching is an essential requirement for any com put- 
er system which performs useful work. The decision 10 
to branch or not to branch may be based on one or 
more events. These events, often referred to as con- 
ditions, include: positive, negative or zero numbers, 
overflow, underflow, or carry from the last arithmetic 
operation, even or odd parity, and many others. Con- 15 
drtional branches are performed in digital computers 
by conditional branch instructions. Conditional 
branch instructions may be used to construct such 
high level programming constructs as loops and if- 
then-etse statements. Because the loops and if-then- 20 
else programming constructs are so common, it is es- 
sential that the conditional branch instructions which 
implement them execute as efficiently as possible. 

A computer instruction is executed by performing 
one or more steps. Typically, these steps are first to 25 
fetch the instruction pointed to by a program counter, 
second to decode and perform the operation indicat- 
ed by the instruction and finally to save the results. A 
simple branch instruction changes the contents of the 
program counter in order to cause execution to "jump" 30 
to somewhere else in the program. In order to speed 
up the execution of computer instructions, a techni- 
que of executing more than one instruction at the 
same time, called pipelining, was developed. Pipelin- 
ing permits, for example, the central processing unit, 35 
CPU, to fetch one instruction while executing another 
instruction and while saving the results of a third in- 
struction at the same time, in pipelined computer ar- 
chitectures, branching is an expensive operation be- 
cause branch instructions may cause other instruc- 40 
tions in the pipeline to be held up pending the out- 
come of the branch instruction. When a conditional 
branch instruction is executed with the condition true, 
it causes the CPU to continue execution at a new ad- 
dress referred to as a target address. Since instruc- 45 
tion fetching is going on simultaneously with instruc- 
tion decoding and execution in a pipelined computer, 
the computer has already fetched the instruction fol- 
lowing the branch instruction in the program. This is 
different instruction than the instruction at the target 50 
address. Therefore, the CPU must hold up the in- 
struction pipeline following the branch instruction un- 
til the outcome of the branch instruction is known and 
the proper instruction fetched. In order to maximize 
throughput of the computer, computer designers have 55 
attempted to design computers which maximize 
throughput by minimizing the need to hold up the in- 
struction pipeline. 



In the prior art, several schemes have been used 
to avoid holding up the instruction pipeline for condi- 
tional branches. First, some high performance proc- 
essors have used various branch prediction schemes 
to guess whether the conditional branch will be taken 
or not. This approach requires extensive hardware 
and is unacceptable in all but the highest perfor- 
mance computers because of the expensive hard- 
ware required. Second, other architectures have 
fetched both the instruction in the program following 
the branch and the instruction at the branch target ad- 
dress. This approach is unacceptable because it also 
requires expensive hardware and additional memory 
accesses to always fetch both instructions. Third, 
some architectures have a bit in the instruction to tell 
the computer whether it is more probable for the in- 
struction following the branch or the instruction at the 
branch target address to be executed. The computer 
then fetches the more probable instruction and holds 
up the pipeline only if the guess is wrong. This ap- 
proach requires expensive hardware and if the guess 
is wrong causes additional time to be spent backing 
up the pipeline and fetching appropriate instruction. 
Fourth, other architectures allow two bits which in- 
struct the CPU to always or never execute the in- 
struction following the branch instruction based on 
whether the branch is taken or not taken. This archi- 
tecture uses too many bits from the instruction there- 
by reducing the maximum range of the branch in- 
struction. Finally, still other architectures always exe- 
cute the instruction in the program following the 
branch instruction before taking or not taking the 
branch. 

The technique of executing the instruction in the 
program following the branch instruction is known as 
delayed branching. Delayed branching is desirable 
since the instruction in the pipeline is always execut- 
ed and the pipeline is not held up. This occurs be- 
cause delayed branching gives the computer time to 
execute the branch instruction and computer the ad- 
dress of the next instruction while executing the in- 
struction in the pipeline. Although this technique 
avoids holding up the instruction pipeline, it may re- 
quire placing a no operation instruction following the 
branch instruction, which would not improve perfor- 
mance since the additional memory access negates 
any improvement. 

One software technique which takes advantage 
of delayed branching is merger. Merger works with 
loop constructs where the loop branch instruction is 
at the end of the loop. Merger takes advantage of de- 
layed branching by duplicating the first instruction of 
the loop following the loop's branch instruction and 
making the branch target address the second instruc- 
tion of the loop. One potential problem with merger is 
that on exit from the loop, the program does not nec- 
essarily want to execute the delayed branch instruc- 
tion again. This is a problem for architectures which 
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always use delayed branching. 

When many prior art computer systems deter- 
mine that a branch is about to be executed, the com- 
puter systems hold up, or interlock, the instruction pi- 
peline. Interlocking the pipeline involves stopping the 
computer from fetching the next instruction and pre- 
venting the pipeline from advancing the execution of 
any of the instructions in the pipeline. Interlocking re- 
duces the performance increase gained by pipelining 
and therefore is to be avoided. 

In accordance with the invention as claimed in 
the parent application EP-A-0 207 665, a method and 
apparatus are provided for conditional delayed 
branching within a digital computer. 

The present invention provides a method of an 
apparatus for nullification of for example the delay 
slot instruction following the branch instruction where 
the delay slot instruction cannot be used efficiently. 

DESCRIPTION OF DRAWINGS 

Figure 1 is a branch instruction. 
Figure 2 illustrates a method of branching. 
Figure 3 is a flow chart of the method of branch- 
ing. 

Figure 4 is a functional block diagram of an ap- 
paratus in accordance with the preferred embodiment 
of the present invention. 

Figure 5 is a timing state diagram of the appara- 
tus in Figure 4. 

DESCRIPTION OF THE PREFERRED 
EMBODIMENT 

Figure 1 is a branch instruction. The branch in- 
struction 501 consists of 32 bits of information used 
by the computer to execute the instruction. This in- 
struction combines the function of branching with the 
operation of comparing two operands. The instruction 

501 contains a six bit operation code field 502, a five 
bit first source register address field 503, a five bit 
second source register address field 504, a three bit 
condition code field 505, a eleven bit branch displace- 
ment field 506 and one bit displacement field sign bit 
508, and a nullify bit 507. The operation code field 

502 identifies this instruction as a compare and 
branch instruction. The first and second source reg- 
ister address fields 503 and 504 identify the registers 
whose contents will be compared. The branch dis- 
placement, which may be positive or negative, is de- 
termined by fields 508 and 506. This displacement is 
used to calculate the target address for the branch. 
The next instruction in the instruction pipeline may be 
nullified according to the present invention by setting 
the nullify bit 507. 

In the present invention, the execution of the cur- 
rent instruction may be nullified. The purpose of nul- 
lification is to make the instruction appear as if it nev- 



er existed in the pipeline even through the instruction 
may have been fetched and its operation performed. 
Nullification is accomplished by preventing that in- 
struction from changing any state of the CPU. To pre- 

5 vent changing the state of the computer, the nullifica- 
tion process must prevent the writing of any results of 
the nullified instruction to any registers or memory lo- 
cation and prevent any side effects from occurring, 
for example, the generation of interrupts caused by 

10 the nullified instruction. This is performed by qualify- 
ing any write signals with the nullify signal generated 
in the previous instruction thus preventing the in- 
struction from storing any results of any calculation 
or otherwise changing the state of the computer sys- 

15 tern. Asimple way of qualifying the write signals of the 
current instruction is by 'AND'ing the write signals 
with a retained copy of the nullify signal signal gen- 
erated in the previous instruction. The nullify signal 
generated by an instruction may, for example, be 

20 saved in the processor status word for use in the fol- 
lowing instruction. Nullification is a very useful tech- 
nique because it permits an instruction to be fetched 
into the pipeline without concern as to whether a de- 
cision being made by another instruction in the pipe- 

25 line may cause this instruction not to be executed. 
The instruction simply progresses through the pipe- 
line until it comes time to store its results and the in- 
struction may then be nullified at the last minute with 
the same effect as if the instruction never existed in 

30 the pipeline. 

In a pipelined computer system there are two dis- 
tinct concepts as to the next instruction to be execut- 
ed. The first concept is a time sequential instruction, 
which is the next instruction in the instruction pipeline 

35 after the current instruction. This instruction will be 
executed after the current instruction and the results 
of the operation stored unless nullified. The second 
concept is a space sequential instruction. This is the 
instruction immediately following the current instruc- 

40 tion in the program. Generally, the space sequential 
instruction for the current instruction will be the time 
sequential instruction. The exception to the rule oc- 
curs with taken branch instructions, where the time 
sequential instruction is the instruction at the target 

45 address which is generally not the space sequential 
instruction of the branch instruction. 

The delay slot instruction is the time sequential 
instruction of a branch instruction. Generally, the de- 
lay slot instruction will be the space sequential in- 

50 struction of the branch instruction. The exception to 
this rule is the case of a branch following a branch in- 
struction. For this case, the delay slot instruction for 
the second branch instruction will be the target ad- 
dress of the first branch instruction rather than the 

55 space sequential instruction of the second branch in- 
struction. 

Unconditional branching dearly illustrates the 
concept of nullification and the delay slot instruction. 
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With the nullify bit off, the delay slot instruction of the 
unconditional branch instruction is always executed. 
This is equivalent to always using delayed branching. 
With the nullify bit off, the delay slot instruction of the 
unconditional branch instruction is always nullified. 
This is equivalent to never executing the delay slot in- 
struction. 

Figure 2 illustrates a method of conditional 
branching. A computer practicing the method of Fig- 
ure 2 has a program 101 consisting of instructions 
100 including a conditional branch instruction 102. 
The space sequential instruction to the branch in- 
struction 102 is instruction 103. For a conditional 
branch instruction 102 with negative branch displace- 
ment, instruction 104 is at the target address. For a 
conditional branch instruction 102 with a positive 
branch displacement, instruction 105 is at the target 
address. The execution of the program is illustrated 
by graphs 110, 111, 112, 113, and 114. During normal 
execution, the program executes the current instruc- 
tion and then executes the space sequential instruc- 
tion to the current instruction. 

Graphs 110, 111 and 113 illustrate the operation 
of a branch instruction with the nullify bit off. This cor- 
responds to the 'never nullify* or 'always execute' ca- 
se. The delay slot instruction following the branch in- 
struction is always executed regardless of whether 
the branch if taken or not and whether it has a positive 
or negative displacement When the branch condition 
is false, execution continues with the space sequen- 
tial instruction 103 as shown in graph 110. When the 
branch condition is true, the delay slot instruction is 
executed and then the instruction at the target ad- 
dress is executed as shown in graph 111 for a nega- 
tive displacement branch and in graph 113 for a pos- 
itive displacement branch. 

Graph 110, 111, 112 and 114 illustrate the oper- 
ation of a branch instruction with the nullify bit on. 
This corresponds to the 'sometimes nullify' case as 
described below. With the nullify bit on, the delay slot 
instruction may be nullified depending on the direc- 
tion of the branch and whether the condition deter- 
mining whether the branch is taken or not is true or 
false. Graphs 110 and 114 illustrate the operation of 
the branch instruction when the condition triggering 
the branch is false causing the branch not to be taken. 
If the branch displacement is positive, the delay slot 
instruction is executed as shown by graph 110. If the 
branch displacement is negative, the delay slot in- 
struction is nullified as shown by graph 114. The dot- 
ted line in graphs 112 and 114 indicate that the delay 
slot instruction, although fetched, will be nullified as 
if it never existed in the instruction pipeline. 

Graphs 111 and 112 illustrate the operation of the 
branch instruction with the nullify bit on when the con- 
dition triggering the branch is true causing the branch 
to be taken. If the branch displacement is positive, the 
delay slot instruction is nullified as shown in graph 



112 and execution continues at the target address. If 
the branch displacement is negative, the delay slot in- 
struction is executed as shown in graph 111 before 
continuing at the target address. 

5 Figure 3 is a flow chart of the method of branch- 

ing. The graphs 111 through 114 may be more clearly 
understood by referring to the flow chart The first 
step is to determine whether the nullify bit is on. If the 
nullify bit is off, then the delay slot instruction for the 

10 branch instruction is always executed. This occurs 
whether or not the branch is taken. If the nullify bit is 
on, then the delay slot instruction following the 
branch is not executed unless the branch is taken and 
the branch displacement is negative, or unless the 

15 branch is not taken and the branch displacement is 
positive. 

The operation embodies a very simple but effec- 
tive method of static branch prediction which predicts 
whether the branch will be taken or not and therefore 

20 which instruction to fetch, based on how positive and 
negative displacement branches are taken. Its effec- 
tiveness depends on computer software following a 
set of software conventions in implementing certain 
higher level program control constructs by means of 

25 a conditional branch instruction. For example, a loop 
construct is implemented by a backward conditional 
branch, so that a branch instruction with a negative 
displacement will be taken frequently. In fact, it will be 
taken N-1 out of N times for a loop that is executed N 

30 times. Another example of the software conventions 
assumed is that an if- then -else construct is imple- 
mented by a forward branch to the rarely taken part, 
allowing the more frequently executed part to lie im- 
mediately following the branch instruction in the not 

35 taken branch path. For example, the forward branch 
may be to an error handling routine which rarely gets 
executed in a normal program. The embodiment of 
the present invention having a nullify bit generalizes 
and optimizes the use of the delay slot instruction in 

40 conjunction with the static branch prediction techni- 
que described above. With the nullify bit on, a back- 
ward conditional branch that is taken or a forward 
conditional branch that is not taken, being the tasks 
that are predicted to be frequent by the static branch 

45 prediction technique, cause the delay slot instruction 
to be executed. Hence, some useful instruction in the 
frequent path may be executed as the delay slot in- 
struction, for example, as described in the merger 
technique above. With the nullify bit on, a backward 

so conditional branch that is not taken or a forward con- 
ditional branch that is taken, being the tasks that are 
predicted to be rare, cause the delay slot instruction 
to be nullified. Hence, nullification which reduces 
performance occurs only in the rare case. 

55 With the nullify bit off, the delay slot instruction 

is always executed. This corresponds to the case 
where an instruction common to both the branch tak- 
en and the branch not taken paths can be designated 
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as the delay slot instruction. 

Figure 4 is a functional block diagram of an ap- 
paratus in accordance with the preferred embodiment 
of the present invention. The apparatus contains six 
functional elements: an instruction memory 301 , an 5 
optional virtual address translation unit 302, an in- 
struction unit 303, an execution unit 304, an optional 
floating point unit 305 and an optional register file 
306. These functional elements are connected to- 
gether through five busses: a result bus 310, a first 10 
operand bus 311 , a next instruction bus 312, a second 
operand bus 313 and an address bus 314. Only the 
execution unit 304 and the instruction unit 303 are in- 
volved in performing the operation of the preferred 
embodiment of the present invention. The execution is 
unit generates and/or stores the conditions on which 
the decision to branch or not to branch is made. The 
instruction unit performs the branch by generating 
the address of the next instruction to be fetched from 
the memory and provides means for storing the ad- 20 
dress into the program counter. In the preferred em- 
bodiment of the present invention, the memory unit is 
a high speed cache with speed on the order of the log- 
ic used in the execution unit 

Figure 5 is a timing state diagram of the a p para- 25 
tus in Figure 4. The timing diagram illustrates four sta- 
ges involved in the execution of instructions 401 , 
402, 403 and 404. Time line 460 is divided into stages 
with the time progressing to the right. The four timing 
stages for each instruction are: an instruction ad- 30 
dress generation stage 410, an instruction fetch 
stage 411, an execute stage 412, and a write stage 
413. The execution of instructions may be pipelined 
to any depth desired. The preferred embodiment of 
the present invention contains a four stage pipeline. 35 
As shown in Figure 5, four instructions are being exe- 
cuted at any one time. At time 450, the write stage of 
instruction 401 is overlapped with the execution 
stage of instruction 402, the instruction fetch stage of 
instruction 403 and the instruction address genera- 40 
tion stage of instruction 404. This means for a branch 
instruction that next instruction will have been fetch- 
ed while the branch instruction is in the execution 
stage. During the instruction address generation 
stage, the address of the next instruction is calculated 45 
from the program counter which contains the address 
of the next instruction to be executed and is located 
in the instruction unit 303. During the instruction fetch 
stage, the next instruction is fetched from the instruc- 
tion memory 301. This is performed by applying the so 
contents of the address calculated in the instruction 
address generation stage onto the address bus 314 
and transferring the contents of that address to the 
next instruction bus 312 where it is decoded by the in- 
struction unit The branch instruction may be com- 55 
bined with other operations, for example, a compare 
operation, which would be also decoded and per- 
formed at this time in the execution unit 304. 



In the execute stage 412, the branch instruction 
is performed. During the execute phase 412 both the 
target address of the branch instruction and the ad- 
dress of the space sequential instruction to the 
branch instruction are generated. At this time if the in- 
struction is combined with another operation, that op- 
eration is performed. At the end of the execution 
phase, one of the two addresses is transferred into 
the program counter. Which address to transfer to the 
program counter is determined by the condition stor- 
ed in the execution unit 304. During the write phase 
413, no operation occurs unless a result from a com- 
bined instruction needs to be stored. By performing 
all writing of any results to memory or registers and 
any side effects like interrupt acknowledgement 
caused by an instruction no earlier than stage 412 
and 413, this approach enables a simpler implemen- 
tation of the concept of nullifying an instruction which 
is always in the pipeline. 



Claims 

1. A method of nullifying an instruction which per- 
forms an operation and is capable of generating 
results, errors, traps and interrupts in a pipelined 
computer system having memory, the method 
comprising: 

fetching first and second instructions from 
memory into the instruction pipeline; 

performing the instructions indicated by 
the first and second instructions^ and 

preventing any results, errors, traps or in- 
terrupts generated by fetching or performing the 
operation indicated by the second instruction 
from being stored in the computer system or af- 
fecting the operation of the computer system, 

characterised in that the first instruction 
has a nullification field which is stored with the re- 
sult of the operation indicated by the first instruc- 
tion, and that said. results, errors, traps or inter- 
rupts are prevented from being stored as a con- 
dition of the state of the nullification field of the 
first instruction. 

2. An apparatus for permitting in a computer system 
a first instruction, having a nullify signal having 
either a true or a false state, to nullify a second 
instruction dependent on the state of said nullify 
signal, which with a write signal stores the results 
of the second instruction into the computer or 
generates errors, traps, and interrupts in the 
computer system, the apparatus comprising: 

means for retaining the state of the nullify 
signal after the execution of the first instruction; 
and 

means for qualifying the write signal of the 
second instruction with the retained state of the 
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nullify signal in order to prevent results from the 
execution of the second instruction from being 
stored in the computer system or any errors, 
traps or interrupts from affecting the operation of 
the computer system. 5 



speichert werden, oder daB Fehler, nicht- 
programmierte Programmsprunge Oder Unter- 
brechungen den Betrieb des Computersystems 
beeinflussen. 



Patentanspruche 

1 . Ein Verfahren zum Annullieren eines Befehls, der 10 
eine Operation durchfuhrt und fahig ist, Ergebnis- 
se, Fehler, nicht-programmierte Programmsprun- 
ge und Unterbrechungen in ein em Computersy- 
stem mit Speicher, daB das Pipeline- Verfahren 
verwendet, zu erzeugen, wobei das Verfahren 15 
folgende Schritte aufweist: 

Holen eines ersten und eines zweiten Befehis 
aus dem Speicher in die Befehls-Pipeline; 
Durchfuhren der Befehle, die durch den ersten 
und den zweiten Befehl angezeigt sind, und 20 
Vermeiden, dad irgendwelche Ergebnisse, Feh- 
ler, nicht-programmierte Programmsprunge oder 
Unterbrechungen, die durch das Holen oder 
Durchfuhren der Operation, die durch den zwei- 
ten Befehl angezeigt ist, erzeugt werden, in dem 25 
Computersystem gespeichert werden oder den 
Betrieb des Computersystems beeinflussen, 
dadurch gekennzeichnet, 

daft der erste Befehl ein Annullierungsfeld hat, 
das mit dem Ergebnis der Operation, die durch 30 
den ersten Befehl angezeigt ist, gespeichert ist, 
und 

daB vermieden wird, daB die Ergebnisse, Fehler, 
nicht-programmierten Programmsprunge oder 
Unterbrechungen ats eine Bedingung des Zu- 35 
stands des Annullierungsfeldes des ersten Be- 
fehls gespeichert werden. 

2. Eine Vorrichtung, um es einem ersten Befehl in 
einem Computersystem mit einem Annullierungs- 40 
signal, das entweder einen wahren oder einen 
falschen Zustand hat, zu ermoglichen, einen 
zweiten Befehl abhangig vom Zustand des Annul- 
lierungssignals zu annullieren, mit dem ein 
Schreibsignal die Ergebnisse des zweiten Be- 45 
fehls in dem Computer speichert oder Fehler, 
nicht-programmierte Programmsprunge und Un- 
terbrechungen in dem Computersystem erzeugt, 
wobei die Vorrichtung folgende Merkmale um- 
faBt: 50 
eine Einrichtung zum Zuruckhalten des Zustands 

des Annull ierungssignals nach der Ausfuhrung 
des ersten Befehls; und 

eine Einrichtung zum QuaJifizieren des Schreib- 
signals des zweiten Befehls mit dem zuruckge- 55 
haltenen Zustand des Annullierungssignals, um 
zu vermeiden, daB Ergebnisse der Ausfuhrung 
des zweiten Befehls in dem Computersystem ge- 



Revendications 

1. Precede d'annulation d'une instruction qui exe- 
cute une operation, et qui est capable de generer 
des resultats, des erreurs, des pieges et des in- 
terruptions, dans un system e d'ordinateur a pipe- 
line comportant de la memoire, le precede 
comprenant 

la recherche de la premiere et de la secon- 
de instructions dans la memoire du pipeline des- 
tructions; 

V execution des instructions indiquees par 
la premiere et la second e instructions; et 

I'empechement que tous les resultats, 
erreurs, pieges ou interruptions generes par la 
recherche ou ('execution de I'operation indiquee 
par la seconde instruction, soient stockes dans le 
systeme d'ordinateur, ou modif ient le fonctionne- 
ment du systeme d'ordinateur, 

caracterise en ce que la premiere instruc- 
tion comporte un champ d'annulation qui est 
stocke avec le resultatde Toperation indiquee par 
la premiere instruction, et en ce que les resultats, 
erreurs, pieges ou interruptions sont emp3ches 
d'etre stockes comme condition de i'etat du 
champ d'annulation de la premiere instruction. 

2. Appareil pour permettre dans un systeme d'ordi- 
nateur a une premiere instruction, comportant un 
signal d'annulation ayant un etat soit vrai soit 
faux, d'annuler une seconde instruction depen- 
dant de I'etat dudit signal d'annulation, avec la- 
quelle un signal d'ecriture stocke les resultats de 
la seconde instruction dans I'ordinateur ou gene- 
re des erreurs, pieges et interruptions dans le 
systeme d'ordinateur, I'appareil comprenant: 

des moyens pour retarder I'etat du signal 
d'annulation apres ('execution de la premiere ins- 
truction; et 

. des moyens pour qualifier le signal d'ecri- 
ture de la seconde instruction avec I'etat de re- 
tard du signal d'annulation, af in d'empecher que 
les resultats provenant de I'execution de la se- 
conde instruction soient stockes dans le systeme 
d'ordinateur, ou que toutes erreurs, pieges ou in- 
terruptions modif ient le fonctionnement du syste- 
me d'ordinateur. 
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BIDIRECTIONAL BRANCH PREDICTION AND OPTIMIZATION 



BACKGROUND 



The ability to make decisions by conditionaJ 
branching is an essential requirement for any com- 5 
puter system which performs useful work. The de- 
cision to branch or not to branch may be based on 
one or more events. These events, often referred to 
as conditions, include: positive, negative or zero 
numbers, overflow, underflow, or carry from the last io 
arithmetic operation, even or odd parity, and many 
others. Conditional branches are performed in digi- 
tal computers by conditional branch instructions. 
Conditional branch instructions may be used to 
construct such high level programming constructs /5 
as loops and if-then-eise statements. Because the 
loops and if-then-else programming constructs are 
so common, it is essential that the conditional 
branch instructions which implement them execute 
as efficiently as possible. 20 

A computer instruction is executed by perform- 
ing one or more steps. Typically, these steps are 
first to fetch the instruction pointed to by a program 
counter, second to decode and perform the opera- 
tion indicated by the instruction and finally to save 25 
the results. A simple branch . instruction changes 
the contents of the program counter in order to 
cause execution to "jump" to somewhere else in 
the program. In order to speed up the execution of 
computer instructions, a technique of executing 30 
more than one instruction at the same time, called 
pipelining, was developed. Pipelining permits, for 
example, the central processing unit. CPU, to fetch 
one instruction while executing another instruction 
and while saving the results of a third instruction at 35 
the same time. In pipelined computer architectures, 
branching is an expensive operation because 
branch instructions may cause other instructions in 
the pipeline to be held up pending the outcome of 
the branch instruction. When a conditional branch 40 
instruction is executed with the condition true, it 
causes the CPU to continue execution at a new 
address referred to as a target address. Since 
instruction fetching is going on simultaneously with 
instruction decoding and execution in a pipelined 45 
computer, the computer has already fetched the 
instruction following the branch instruction in the 
program. This is different instruction than the in- 
struction at the target address. Therefore, the CPU 
must hold up the instruction pipeline following the 50 
branch instruction until the outcome of the branch 
instruction is known and the proper instruction 
fetched. v In order to maximize throughput of the 
computer, computer designers have attempted to 
design computers which maximize throughput by 



minimizing the need to hold up the instruction 
pipeline. 

In the prior art, several schemes have been 
used to avoid holding up the instruction pipeline for 
conditional branches. First, some high performance 
processors have used various branch prediction 
schemes to guess whether the conditional branch 
will be taken or not. This approach requires exten- 
sive hardware and is unacceptable in all but the 
highest performance computers because of the ex- 
pensive hardware required. Second, other architec- 
tures have fetched both the instruction in the pro- 
gram following the branch and the instruction at the 
branch target address. This approach is unaccep- 
table because it also requires expensive hardware 
and additional memory accesses to always fetch 
both instructions. Third, some architectures have a 
bit in the instruction to tell the computer whether it 
is more probable for the instruction following the 
branch or the instruction at the branch target ad- 
dress to be executed. The computer then fetches 
the more probable instruction and holds up the 
pipeline only if the guess is wrong. This approach 
requires expensive hardware and if the guess is 
wrong causes additional time to be spent backing 
up the pipeline and fetching appropriate instruction. 
Fourth, other architectures allow two bits which 
instruct the CPU to always or never execute the 
instruction following the branch instruction based 
on whether the branch is taken or not taken. This 
architecture uses too many bits from the instruction 
thereby reducing the maximum range of the branch 
instruction. Finally, still other architectures always 
execute the instruction in the program following the 
branch instruction before taking or not taking the 
branch. 

The technique of executing the instruction in 
the program following the branch instruction is 
known as delayed branching. Delayed branching is 
desirable since the instruction in the pipeline is 
always executed and the pipeline is not held up. 
This occurs because delayed branching gives the 
computer time to execute the branch instruction 
and computer the address of the next instruction 
while executing the instruction in the pipeline. Al- 
though this technique avoids holding up the in- 
struction pipeline, it may require placing a no op- 
eration instruction following the branch instruction, 
which would not improve performance since the 
additional memory access negates any improve- 
ment. 

One software technique which takes advantage 
of delayed branching is merger. Merger works with 
loop constructs where the loop branch instruction is 
at the end of the loop. Merger takes advantage of 
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delayed branching by duplicating the first instruc- 
tion of the loop following the loop's branch instruc- 
tion and making the branch target address the 
second instruction of the loop. One potential prob- 
lem with merger is that on exit from the loop, the s 
program does not necessarily want to execute the 
delayed branch instruction again. This is a problem 
for architectures which always use delayed branch- 
ing. 

When many prior art computer systems deter- 10 
mine that a branch is about to be executed, the 
computer systems hold up, or interlock, the instruc- 
tion pipeline. Interlocking the pipeline involves 
stopping the computer from fetching the next in- 
struction and preventing the pipeline from advan- 15 
cing the execution of any of the instructions in the 
pipeline. Interlocking reduces the performance in- 
crease gained by pipelining and therefore is to be 
avoided. 

What is needed is a method of conditional 20 
branching which minimizes the amount of hardware 
and performance reductions. The method should 
take as few bits of the instruction as possible since 
each bit taking effectively halves the maximum 
range of the branch instruction. 25 



provides a more flexible and efficient nullification 
scheme based on direction of branching rather 
than always executing or never executing the in- 
struction following the branch. 



DESCRIPTION OF DRAWINGS 



Figure 1 is a branch instruction in accordance 
with the preferred embodiment of the present 
invention. 

Figure 2 illustrates a method of branching in 
accordance with the preferred embodiment of 
the present invention. 

Figure 3 is a flow chart of the method of branch- 
ing. 

Figure 4 is a functional block diagram of an 
apparatus in accordance with the preferred em- 
bodiment of the present invention. 
Figure 5 is a timing state diagram of the appara- 
tus in Figure 4. 

DESCRIPTION OF THE PREFERRED EMBOD1- 

MENT 



SUMMARY 



In accordance with the preferred embodiment 
of the present invention, a method and apparatus 
are provided for conditional branching within a digi- 
tal computer. The preferred embodiment of the 
present invention provides a branch instruction 
which statically predicts whether the branch will be 
taken or not taken based on the branch displace- 
ment. The method uses delayed branching where 
possible but also provides for nullification of the 
delay slot instruction following the branch instruc- 
tion where the delay slot instruction cannot be used 
efficiently. 

The present invention is superior to the prior 
art in several ways. First, the preferred embodi- 
ment of the present invention is capable of a 
branch frequently/branch rarely prediction for con- 
ditional branch instructions based on the existing 
sign bit of the branch displacement without requir- 
ing any other bit in the instruction. Second, the 
preferred embodiment of the present invention op- 
timizes the use of the instruction immediately fol- 
lowing the conditional branch which reduces the 
probability of holding up the instruction pipeline 
and its resulting reduction in performance. Third, 
the preferred embodiment of the present invention 
nullifies the instruction following the branch only in 
cases when the instruction cannot be used. Finally, 
the preferred embodiment of the present invention 



Figure 1 is a branch instruction in accordance 
with the preferred embodiment of the present in- 

30 vention. The branch instruction 501 consists of 32 
bits of information used by the computer to ex- 
ecute the instruction. This instruction combines the 
function of branching with the operation of compar- 
ing two operands, although the present invention t 

35 could be implemented by a branch only instruction 
as well. The instruction 501 contains a six bit 
operation code field 502, a five bit first source 
register address field 503, a five bit second source 
register address field 504, a three bit condition 

40 code field 505, a eleven bit branch displacement 
field 506 and one bit displacement field sign bit 
508, and a nullify bit 507. The operation code field 
502 identifies this instruction as a compare and 
branch instruction. The first and second source 

45 register address fields 503 and 504 identify the 
registers whose contents will be compared. The 
branch displacement, which may be positive or 
negative, is determined by fields 508 and 506. This 
displacement is used to calculate the target ad- 

50 dress for the branch. The next instruction in the 
instruction pipeline may be nullified according to 
the preferred embodiment of the present invention 
by setting the nullify bit 507. 

In the preferred embodiment of the present 

55 invention, the execution of the current instruction 
may be nullified. The purpose of nullification is to 
make the instruction appear as if it never existed in 
the pipeline even through the instruction may have 
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been fetched and its operation performed. Nullifica- 
tion is accomplished by preventing that instruction 
from changing any state of the CPU. To prevent 
changing the state of the computer, the nullification 
process must prevent the writing of any results of 
the nullified instruction to any registers or memory 
location and prevent any side effects from occur- 
ring, for example, the generation of interrupts caus- 
ed by the nullified instruction. This is performed in 
the preferred embodiment by qualifying any write 
signals with the nullify signal generated in the 
previous instruction thus preventing the instruction 
from storing any results of any calculation or other- 
wise changing the state of the computer system. A 
simple way of qualifying the write signals of the 
current instruction is by 'AND'ing the write signals 
with a retained copy , of the nullify signal signal 
generated in the previous instruction. The nullify 
signal generated by an instruction may, for exam- 
ple, be saved in the processor status word for use 
in the following instruction. Nullification is a very 
useful technique because it permits an instruction 
to be fetched into the pipeline without concern as 
to whether a decision being made by another in- 
struction in the pipeline may cause this instruction 
not to be executed. The instruction simply pro- 
gresses through the pipeline until it comes time to 
store its results and the instruction may then be 
nullified at the last minute with the same effect as if 
the instruction never existed in the pipeline. 

In a pipelined computer system there are two 
distinct concepts as to the next instruction to be 
executed. The first concept is a time sequential 
instruction, which is the next instruction in the 
instruction pipeline after the current instruction. 
This instruction will be executed after the current 
instruction and the results of the operation stored 
unless nullified. The second concept is a space 
sequential instruction. This is the instruction imme- 
diately following the current instruction in the pro- 
gram. Generally, the space sequential instruction 
for the current instruction will be the time sequen- 
tial instruction. The exception to the rule occurs 
with taken branch instructions, where the time se- 
quential instruction is the instruction at the target 
address which is generally not the space sequen- 
tial instruction of the branch instruction. 

The delay slot instruction is the time sequential 
instruction of a branch instruction. Generally, the 
delay slot instruction will be the space sequential 
instruction of the branch instruction. The exception 
to this rule is the case of a branch following a 
branch instruction. For this case, the delay slot 
instruction for the second branch instruction will be 
the target address of the first branch instruction 
rather than the space sequential instruction of the 
second branch instruction. 

Unconditional branching in the preferred em- 



bodiment of the present invention clearly illustrates 
the concept of nullification and the delay slot in- 
struction. With the nullify bit off, the delay slot 
instruction of the unconditional branch instruction is 

5 always executed. This is equivalent to always using 
delayed branching. With the nullify bit off, the de- 
lay slot instruction of the unconditional branch in- 
struction is always nullified. This is equivalent to 
never executing the delay slot instruction. 

io Figure 2 illustrates a method of conditional 

branching in accordance with preferred embodi- 
ment of the present invention. A computer practic- 
ing the method of Figure 2 has a program 101 
consisting of instructions 100 including a condi- 

15 tional branch instruction 102. The space sequential 
instruction to the branch instruction 102 is instruc- 
tion 103. For a conditional branch instruction 102 
with negative branch displacement, instruction 104 
is at the target address. For a conditional branch 

20 instruction 102 with a positive branch displacement, 
instruction 105 is at the target address. The execu- 
tion of the program is illustrated by graphs 110. 
111, 112, 113, and 114. During normal execution, 
the program executes the current instruction and 

25 then executes the space sequential instruction to 
the current instruction. 

Graphs 110, 111 and 113 illustrate the opera- 
tion of a branch instruction with the nullify bit off. 
This corresponds to the 'never nullify' or 'always 

30 execute' case. The delay slot instruction following 
the branch instruction is always executed regard- 
less of whether the branch if taken or not and 
whether it has a positive or negative displacement. 
When the branch condition is false, execution con- 

35 tinues with the space sequential instruction 103 as 
shown in graph 110. When the branch condition is 
true, the delay slot instruction is executed and then 
the instruction at the target address is executed as 
shown in graph 111 for a negative displacement 

40 branch and in graph 113 for a positive displace- 
ment branch. 

Graph 110, 111, 112 and 114 illustrate the 
operation of a branch instruction with the nullify bit 
on. This corresponds to the 'sometimes nullify' 

45 case as described below. With the nullify bit on, 
the delay slot instruction may be nullified depend- 
ing on the direction of the branch and whether the 
condition determining whether the branch is taken 
or not is true or false. Graphs 110 and 114 illustrate 

so the operation of the branch instruction when the 
condition triggering the branch is false causing the 
branch not to be taken. If the branch displacement 
is positive, the delay slot instruction is executed as 
shown by graph 110. If the branch displacement is 

55 negative, the delay slot instruction is nullified as 
shown by graph 114. The dotted line in graphs 112 
and 114 indicate that the delay slot instruction, 
although fetched, will be nullified as if it never 
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existed in the instruction pipeline. 

Graphs 111 and 112 illustrate the operation of 
the branch instruction with the nullify bit on when 
the condition triggering the branch is true causing 
the branch to be taken. If the branch displacement 
is positive, the delay slot instruction is nullified as 
shown in graph 112 and execution continues at the 
target address. If the branch displacement is nega- 
tive, the delay slot instruction is executed as shown 
in graph 111 before continuing at the target ad- 
dress. 

Figure 3 is a flow chart of the method of 
branching. The graphs 111 through 114 may be 
more clearly understood by referring to the flow 
chart. The first step is to determine whether the 
nullify bit is on. If the nullify bit is off, then the 
delay slot instruction for the branch instruction is 
always executed. This occurs whether or not the 
branch is taken. If the nullify bit is on, then the 
delay slot instruction following the branch is not 
executed unless the branch is taken and the 
branch displacement is negative, or unless the 
branch is not taken and the branch displacement is 
positive. 

The operation of the preferred embodiment of 
the present invention embodies a very simple but 
effective method of static branch prediction which 
predicts whether the branch will be taken or not, 
and therefore which instruction to fetch, based on 
how positive and negative displacement branches 
are taken. Its effectiveness depends on computer 
software following a set of software conventions in 
implementing certain higher level program control 
constructs by means of a conditional branch in- 
struction. For example, a loop construct is imple- 
mented by a backward conditional branch, so that 
a branch instruction with a negative displacement 
will be taken frequently. In fact, it will be taken N-1 . 
out of N times for a loop that is executed N times. 
Another example of the software conventions as- 
sumed is that an if-then-eise construct is imple- 
mented by a forward branch to the rarely taken 
part, allowing the more frequently executed part to 
lie immediately following the branch instruction in 
the not taken branch path. For example, the for- 
ward branch may be to an error handling routine 
which rarely gets executed in a normal program. In 
addition, the preferred embodiment of the present 
invention having a nullify bit generalizes and op- 
timizes the use of the delay slot instruction in 
conjunction with the static branch prediction tech- 
nique described above. With the nullify bit on, a 
backward conditional branch that is taken or a 
forward conditional branch that is not taken, being 
the tasks that are predicted to be frequent by the 
static branch prediction technique, cause the delay 
slot instruction to be executed. Hence, some useful 
instruction in the frequent path may be executed as 



the delay slot instruction, for example, as described 
in the merger technique above. With the nullify bit 
on, a backward conditional branch that is not taken 
or a forward conditional branch that is taken, being 

5 the tasks that are predicted to be rare, cause the 
delay slot instruction to be nullified. Hence, nul- 
lification which reduces performance occurs only in 
the rare case. 

With the nullify bit off, the delay slot instruction 

10 is always executed. This corresponds to the case 
where an instruction common to both the branch 
taken and the branch not taken paths can be des- 
ignated as the delay slot instruction. 

Figure 4 is a functional block diagram of an 

75 apparatus in accordance with the preferred em- 
bodiment of the present invention. The apparatus 
contains six functional 'elements: an instruction 
memory 301 , an optional virtual address translation 
unit 302, an instruction unit 303, an execution unit 

20 304, an optional floating point unit 305 and an 
optional register file 306. These functional elements 
are connected together through five busses: a re- 
sult bus 310, a first operand bus 311, a next 
instruction bus 312, a second operand bus 313 and 

25 an address bus 314. Only the execution unit 304 
and the instruction unit 303 are involved in per- 
forming the operation of the preferred embodiment 
of the present invention. The execution unit gen- 
erates and/or stores the conditions on which the 

30 decision to branch or not to branch is made. The 
instruction unit performs the branch by generating 
the address of the next instruction to be fetched 
from the memory and provides means for storing 
the address into the program counter. In the pre- 

35 f erred embodiment of the present invention, the 
memory unit is a high speed cache with speed on 
the order of the logic used in the execution unit 

Figure 5 is a timing state diagram of the ap- 
paratus in Figure 4. The timing diagram illustrates 

40 four stages involved in the execution of instructions 
401, 402, 403 and 404. Time line 460 is divided 
into stages with the time progressing to the right. 
The four timing stages for each instruction are: an 
instruction address generation stage 410, an in- 

45 struction fetch stage 411, an execute stage 412, 
and a write stage 413. The execution of instructions 
may be pipelined to any depth desired. The pre- 
ferred embodiment of the present invention con- 
tains a four stage pipeline. As shown in Figure 5, 

so four instructions are being executed at any one 
time. At time 450, the write stage of instruction 401 
is overlapped with the execution stage of instruc- 
tion 402, the instruction fetch stage of instruction 
403 and the instruction address generation stage of 

55 instruction 404. This means for a branch instruction 
that next instruction will have been fetched while 
the branch instruction is in the execution stage. 
During the instruction address generation stage, 
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the address of the next instruction is calculated 
from the program counter which contains the ad- 
dress of the next instruction to' be executed and is 
located in the instruction unit 303. During the in- 
struction fetch stage, the next instruction is fetched 
from the instruction memory 301 . This is performed 
by applying the contents of the address calculated 
in the instruction address generation stage onto the 
address bus 314 and transferring the contents of 
that address to the next instruction bus 312 where 
it is decoded by the instruction unit The branch 
instruction may be combined with other operations, 
for example, a compare operation, which would be 
also decoded and performed at this time in the 
execution unit 304. 

In the execute stage 41 2, the branch instruction 
is performed. During the execute phase 412 both 
the target address of the branch instruction and the 
address of the space sequential instruction to the 
branch instruction are generated. At this time if the 
instruction is combined with another operation, that 
operation is performed. At the end of the execution 
phase, one of the two addresses is transferred into 
the program counter. Which address to transfer to 
the program counter is determined by the condition 
stored in the execution unit 304. During the write 
phase 413, no operation occurs unless a result 
from a combined instruction needs to be stored. By 
performing all writing of any results to memory or 
registers and any side effects like interrupt ac- 
knowledgement caused by an instruction no earlier 
than stage 412 and 413, this approach enables a 
simpler implementation of the concept of nullifying 
an instruction which is always in the pipeline. 



Claims 



10 



tern, the apparatus comprising: 
means for retaining the state of the nullify signal 
after the execution of the first instruction; and 
means for qualifying the write signal of the second 
instruction with the retained state of the nullify 
signal in order to prevent results from the execu- 
tion of the instruction from being stored in the 
computer system or any errors, traps or interrupts 
from affecting the operation of the computer sys- 
tem. 



76 



20 



25 



30 



35 



1. A method for nullifying an instruction which 
performs an operation and is capable of generating 40 
results, errors, traps and interrupts in a pipelined 
computer system having memory, the method 
comprising: 

fetching the instruction from memory into the in- 
struction pipeline: 45 
performing the operation indicated by the instruc- 
tion, and 

preventing any results, errors, traps or interrupts 
generated by fetching or performing the operation 
indicated by the instruction from being stored in 50 
the computer system or affecting the operation of 
the computer system. 

2. An apparatus for permitting in a computer sys- 
tem a first instruction, having a nullify signal with a 

true and false state, to nullify a second instruction, 55 
which with a write signal stores the results of the 
second instruction into the computer or generates 
errors, traps, and interrupts in the computer sys- 
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